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Abstract 

Background: As high-throughput genomic technologies become accurate and affordable, an increasing number of 
data sets have been accumulated in the public domain and genomic information integration and meta-analysis 
have become routine in biomedical research. In this paper, we focus on microarray meta-analysis, where multiple 
microarray studies with relevant biological hypotheses are combined in order to improve candidate marker 
detection. Many methods have been developed and applied in the literature, but their performance and properties 
have only been minimally investigated. There is currently no clear conclusion or guideline as to the proper choice 
of a meta-analysis method given an application; the decision essentially reguires both statistical and biological 
considerations. 

Results: We performed 12 microarray meta-analysis methods for combining multiple simulated expression profiles, 
and such methods can be categorized for different hypothesis setting purposes: (1) HS A : DE genes with non-zero 
effect sizes in all studies, (2) HSb. DE genes with non-zero effect sizes in one or more studies and (3) H5 r : DE gene 
with non-zero effect in "majority" of studies. We then performed a comprehensive comparative analysis through six 
large-scale real applications using four guantitative statistical evaluation criteria: detection capability, biological 
association, stability and robustness. We elucidated hypothesis settings behind the methods and further apply 
multi-dimensional scaling (MDS) and an entropy measure to characterize the meta-analysis methods and data 
structure, respectively. 

Conclusions: The aggregated results from the simulation study categorized the 12 methods into three hypothesis 
settings (HS Al HS B , and HS r ). Evaluation in real data and results from MDS and entropy analyses provided an 
insightful and practical guideline to the choice of the most suitable method in a given application. All source files 
for simulation and real data are available on the author's publication website. 



Background 

Microarray technology has been widely used to identify 
differential expressed (DE) genes in biomedical research 
in the past decade. Many transcriptomic microarray 
studies have been generated and made available in public 
domains such as the Gene Expression Omnibus (GEO) 
from NCBI (http://www.ncbi.nlm.nih.gov/geo/) and Array 
Express from EBI (http://www.ebi.ac.uk/arrayexpress/). 
From the databases, one can easily obtain multiple 
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studies of a relevant biological or disease hypothesis. 
Since a single study often has small sample size and lim- 
ited statistical power, combining information across 
multiple studies is an intuitive way to increase sensitiv- 
ity. Ramasamy, et al. proposed a seven-step practical 
guidelines for conducting microarray meta-analysis [1]: 
"(i) identify suitable microarray studies; (ii) extract the 
data from studies; (iii) prepare the individual datasets; 
(iv) annotate the individual datasets; (v) resolve the 
many-to-many relationship between probes and genes; 
(vi) combine the study-specific estimates; (vii) analyze, 
present, and interpret results". In the first step although 
theoretically meta-analysis increases the statistical power 
to detect DE genes, the performance can be deteriorated if 
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problematic or heterogeneous studies are combined. In 
many applications, the data inclusion/exclusion criteria 
are based on ad-hoc expert opinions, a naive sample size 
threshold or selection of platforms without an objective 
quality control procedure. Kang et al. proposed six quanti- 
tative quality control measures (MetaQC) for decision of 
study inclusion [2]. Step (ii)-(v) are related to data prepro- 
cessing. Finally, Step (vi) and (vii) involve the selection of 
meta-analysis method and interpretation of the result and 
are the foci of this paper. 

Many microarray meta-analysis methods have been 
developed and applied in the literature. According to a 
recent review paper by Tseng et al. [3], popular methods 
mainly combine three different types of statistics: com- 
bine /^-values, combine effect sizes and combine ranks. 
In this paper, we include 12 popular as well as state-of- 
the-art methods in the evaluation and comparison. Six 
methods (Fisher, Stouffer, adaptively weighted Fisher, 
minimum p-value, maximum p-value and rth ordered 
p-value) belonged to the p-value combination category, 
two methods (fixed effects model and random effects 
model) belonged to the effect size combination category 
and four methods (RankProd, RankSum, product of 
ranks and sum of ranks) belonged to the rank combin- 
ation category. Details of these methods and citations 
will be provided in the Method section. Despite the 
availability of many methods, pros and cons of these 
methods and a comprehensive evaluation remain largely 
missing in the literature. To our knowledge, Hong and 
Breitling [4], Campain and Yang [5] are the only two 
comparative studies that have systematically compared 
multiple meta-analysis methods. The number of 
methods compared (three and five methods, respect- 
ively) and the number of real examples examined (two 
and three examples respectively with each example cov- 
ering 2-5 microarray studies) were, however, limited. 
The conclusions of the two papers were suggestive with 
limited insights to guide practitioners. In addition, as 
we will discuss in the Method section, different meta- 
analysis methods have different underlying hypothesis 
setting targets. As a result, the selection of an adequate 
(or optimal) meta-analysis method depends heavily on 
the data structure and the hypothesis setting to achieve 
the underlying biological goal. 

In this paper, we compare 12 popular microarray meta- 
analysis methods using simulation and six real applica- 
tions to benchmark their performance by four statistical 
criteria (detection capability, biological association, stabil- 
ity and robustness). Using simulation, we will characterize 
the strength of each method under three different hypoth- 
esis settings (i.e. detect DE genes in "all studies", "majority 
of studies" or "one or more studies"; see Method section 
for more details). We will compare the similarity and 
grouping of the meta-analysis methods based on their 



DE gene detection results (by using a similarity meas- 
ure and multi-dimension scaling plot) and use an en- 
tropy measure to characterize the data structure to 
determine which hypothesis setting may be more ad- 
equate in a given application. Finally, we give a guide- 
line to help practitioners select the best meta-analysis 
method under the choice of hypothesis setting in their 
applications. 

Methods 

Real data sets 

Six example data sets for microarray meta-analysis 
were collected for evaluations in this paper. Each ex- 
ample contained 4-8 microarray studies. Five of the 
six examples were of the commonly seen two-group 
comparison and the last breast cancer example con- 
tained relapse-free survival outcome. We applied the 
MetaQC package [2] to assess quality of the studies 
for meta-analysis and determined the final inclusion/ 
exclusion criteria. The principal component analysis 
(PCA) bi-plots and the six QC measures are summa- 
rized in Additional file 1: Figure SI, Tables S2 and S3. 
Details of the data sets are available in Additional file 1: 
Table SI. 

Underlying hypothesis settings 

Following the classical convention of Brinbaum [6] and 
Li and Tseng [7] (see also Tseng et al. [3]), meta-analysis 
methods can be classified into two complementary hy- 
pothesis settings. In the first hypothesis setting (denoted 
as HSa), the goal is to detect DE genes that have non- 
zero effect sizes in all studies: 

H 0 : rf =1 {0* = 0} versus H a : n£ =1 {0**O} (HS A ) 

where 6^ is the effect size of study k. The second 
hypothesis setting (denoted as HSb), however, aims to 
detect a DE gene if it has non-zero effect size in "one 
or more" studies: 

H 0 : nf =1 {6 k = 0} versus H a : uf =1 {£^0} (HS B ) 

In most applications, HSa is more appropriate to detect 
conserved and consistent candidate markers across all 
studies. However, different degrees of heterogeneity can 
exist in the studies and HSb can be useful to detect study- 
specific markers (e.g. studies from different tissues are 
combined and tissue specific markers are expected and of 
interest). Since HSa is often too conservative when many 
studies are combined, Song and Tseng (2012) proposed 
a more practical and robust hypothesis setting (namely 
HS r ) that targets on DE genes with non-zero effect sizes 
in "majority" of studies, where majority of studies is 
defined as, for example, more than 50% of combined 
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studies (i.e. r>0.5 K). The robust hypothesis setting 
considered was: 

H °- n£ = i{0* = 0} versus H a : $^/{fl**0} Sr (HS r ) 

A major contribution of this paper is to characterize 
meta-analysis methods suitable for different hypothesis 
settings (HS A , HS B and HS r ) using simulation and real 
applications and to compare their performance with four 
benchmarks to provide a practical guideline. 

Microarray meta-analysis data pre-processing 

Assume that we have K microarray studies to combine. 
For study k (1 < k<K), denote by x gs k the gene expression 
intensity of gene g (l<g<G) and sample s {l<s<S k ; S k 
the number of samples in study k), and y sk the disease/out- 
come variable of sample s. The disease/outcome variable 
can be of binary, multi-class, continuous or censored data, 
representing the disease state, severity or prognosis out- 
come (e.g. tumor versus normal or recurrence survival 
time). The goal of microarray meta-analysis is to combine 
information of K studies to detect differentially expressed 
(DE) genes associated with the disease/outcome vari- 
able. Such DE genes serve as candidate markers for dis- 
ease classification, diagnosis or prognosis prediction 
and help understand the genetic mechanisms under- 
lying a disease. In this paper, before meta-analysis we 
first applied penalized t-statistic to each individual study 
to generate /"-values or DE ranks [8] for a binary outcome. 
In contrast to traditional t-statistic, penalized t-statistic 
adds a fudge parameter s 0 to stabilize the denominator 
(T = (X-Y)/(s + sq); X and Y are means of case and 
control groups) and to avoid a large t-statistic due to small 
estimated variance s . The /"-values were calculated using 
the null distributions derived from conventional non- 
parametric permutation analysis by randomly permuting 
the case and control labels for 10,000 times [9]. For cen- 
sored outcome variables, Cox proportion hazard model 
and log-rank test were used [10]. Meta-analysis methods 
(described in the next subsection) were then used to 
combine information across studies and generate meta- 
analyzed /"-values. To account for multiple comparison, 
Benjamini and Hochberg procedure was used to control 
false discovery rate (FDR) [11]. All methods were imple- 
mented using the "MetaDE" package in R [12]. Data sets 
and all programming codes are available at http://www. 
biostat.pitt.edu/bioinfo/publication.htm. 

Microarray meta-analysis methods 

According to a recent review paper [3], microarray meta- 
analysis methods can be categorized into three types: com- 
bine /^-values, combine effect sizes and combine ranks. 
Below, we briefly describe 12 methods that were selected 
for comparison. 



Combine p-values 

Fisher The Fisher's method [13] sums up the log- 
transformed /"-values obtained from individual studies. 

x — ^ 

The combined Fisher's statistic Jp isher = -2) ,._,log(fi) 
follows a x 2 distribution with 2 k degrees of freedom 
under the null hypothesis (assuming null /"-values are 
uniformly distributed). Note that we perform permuta- 
tion analysis instead of such parametric evaluation for 
Fisher and other methods in this paper. Smaller /"-values 
contribute larger scores to the Fisher's statistic. 

Stouffer Stouffer's method [14] sums the inverse normal 
transformed /"-values. Stouffer's statistics Tstouffer = 

^>~l._iZi/y/k(zi = <$>~ l {pi), where <t> is standard normal 
c.c.f ) follows a standard normal distribution under the 
null hypothesis. Similar to Fisher's method, smaller 
/"-values contribute more to the Stouffer's score, but in a 
smaller magnitude. 

Adaptively weighted (AW) Fisher The AW Fisher's 
method [7] assigns different weights to each indivi- 
dual study r A w = Wk • log( Pj) , Wk = 0 orl and it 
searches through all possible weights to find the best 
adaptive weight with the smallest derived /"-value. One sig- 
nificant advantage of this method is its ability to indicate 
which studies contribute to the evidence aggregation and 
elucidates heterogeneity in the meta-analysis. Details can 
be referred to the Additional file 1. 

Minimum /(-value (minP) The minP method takes the 
minimum /"-value among the K studies as the test statistic 
[15]. It follows a beta distribution with degrees of freedom 
a = 1 and ft = k under the null hypothesis. This method 
detects a DE gene whenever a small /j-value exists in 
any one of the K studies. 

Maximum />-value (maxP) The maxP method takes 
maximum /"-value as the test statistic [16]. It follows a beta 
distribution with degrees of freedom a = K and /? = 1 under 
the null hypothesis. This method targets on DE genes that 
have small /^-values in "all" studies. 

r-th ordered />-value (rOP) The rOP method takes the 
r-th order statistic among sorted /^-values of K com- 
bined studies. Under the null hypothesis, the statistic 
follows a beta distribution with degrees of freedom a = r 
and p = K — r + 1. The minP and maxP methods are spe- 
cial cases of rOP. In Song and Tseng [17], rOP is con- 
sidered a robust form of maxP (where r is set as greater 
than 0.5-K) to identify candidate markers differentially 
expressed in "majority" of studies. 
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Combine effect size 

Fixed effects model (FEM) FEM combines the effect 
size across K studies by assuming a simple linear model 
with an underlying true effect size plus a random error in 
each study. 

Random effects model (REM) REM [18] extends FEM 
by allowing random effects for the inter-study heterogen- 
eity in the model. Detailed formulation and inference of 
FEM and REM are available in the Additional file 1. 

Combine rank statistics 

RankProd (RP) and RankSum (RS) RankProd and 
RankSum are based on the common biological belief 
that if a gene is repeatedly at the top of the lists ordered 
by up- or down-regulation fold change in replicate experi- 
ments, the gene is more likely a DE gene [19]. Detailed 
formulation and algorithms are available in the Additional 
file 1. 

Product of ranks (PR) and Sum of ranks (SR) These 
two methods apply a naive product or sum of the DE 
evidence ranks across studies [20]. Suppose repre- 
sents the rank of /"-value of gene g among all genes in 
study k. The test statistics of PR and SR methods are cal- 
culated as PR g = Y\!k-i^ k anc * = ^Za-i^s*' res P ect " 
ively. P-values of the test statistics can be calculated 
analytically or obtained from a permutation analysis. Note 
that the ranks taken from the smallest to largest (the 
choice in the method) are more sensitive than ranking 
from largest to smallest in the PR method, while it makes 
no difference to SR. 

Characterization of meta-analysis methods 
MDS plots to characterize the methods 

The multi-dimensional scaling (MDS) plot is a useful 
visualization tool for exploring high-dimensional data 
in a low-dimensional space [21]. In the evaluation of 12 
meta-analysis methods, we calculated the adjusted DE 
similarity measure for every pair of methods to quantify 
the similarity of their DE analysis results in a given 
example. A dissimilarity measure is then defined as one 
minus the adjusted DE similarity measure and the 
dissimilarity measure is used to generate an MDS plot 
of the 12 methods. In the MDS plot, methods that are 
clustered in a neighborhood indicate that they produce 
similar DE analysis results. 

Entropy measure to characterize data sets 

As indicated in the Section of "Underlying hypothesis 
settings", selection of the most suitable meta-analysis 
method(s) largely depends on their underlying hypothesis 
setting (HS A , HS B and HS r ). The selection of a hypothesis 



setting for a given application should be based on the 
experimental design, biological knowledge and the 
associated analytical objectives. There are, however, 
occasions that little prior knowledge or preference is 
available and an objective characterization of the data 
structure is desired in a given application. For this pur- 
pose, we developed a data-driven entropy measure to 
characterize whether a given meta-analysis data set con- 
tains more HS A -type markers or HS B -type markers [22]. 
The algorithm is described below: 

1. Apply Fisher's meta-analysis method to combine 
/-values across studies to identify the top 

H candidate markers. Here we used H = 1,000, 

H represents the rough number of DE genes (in our 

belief) that are contained in the data. 

2. For each selected marker, the standardized minus 
/-value score for gene g in the k-th study is defined 

as l gk = -log^J/-^ =1 log^J. Note that 

0 < l g k< 1, large l g k corresponds to more significant 

/-value /t^, and Y^k=M = L 

3. The entropy of gene g is defined as e g = ~^ Jk _-^gk 

log(/^). Box-plots of entropies of the top //genes 
are generated for each meta-analysis application 
(Figure 1(b)). 

Intuitively, a high entropy value indicates that the gene 
has small /"-values in all or most studies and is of HS A 
or HS r -type. Conversely, genes with small entropy have 
small /"-values in one or only few studies where HS B -type 
methods are more adequate. When calculating L k in 
step 2, we capped -log(p gJ 0 at 10 to avoid contributions 
of close-to-zero /"-values that can generate near-infinite 
scores. The entropy box-plot helps determine an appropri- 
ate meta-analysis hypothesis setting if no pre-set biological 
objective exists. 

Evaluation criteria 

For objective quantitative evaluation, we developed the fol- 
lowing four statistical criteria to benchmark performance 
of the methods. 

Detection capability 

The first criterion considers the number of DE genes 
detected by each meta-analysis method under the 
same pre-set FDR threshold (e.g. FDR = 1%). Although 
detecting more DE genes does not guarantee better 
"statistical power", this criterion has served as a surro- 
gate of statistical power in previous comparative studies 
[23]. Since we do not know the underlying true DE 
genes, we refer to this evaluation as "detection capability" 
in this paper. An implicit assumption underlying this 
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Figure 1 Characterization of methods and datasets. (a) Multi-dimensional scaling (MDS) plot of all 12 methods based on the average 
dissimilarity matrix of six examples and (b) The box-plots of entropies in six data sets. Colors (red, green and blue) indicate clusters of methods 
with similar DE detection ordering. High entropies indicate that high consistency of DE gene detection across studies (e.g. MDD). Low entropies 
show greater heterogeneity in DE gene detection (e.g. prostate cancer). 



criterion is that the statistical procedure to detect DE for data variability in the evaluation, we bootstrapped 
genes in each study and the FDR control in the meta- (i.e. sampled with replacement to obtain the same num- 
analysis are accurate (or roughly accurate). To account ber of samples in each bootstrapped dataset) the samples 
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in each study for B = 50 times and show the plots of 
ean with standard error bars. In the bootstrapping, the 
entire sample is either selected or not so the gene 
dependence structure is maintained. Denote by r me f, 
the rank of detection capability performance (the 
smaller the better) of method m (1 < m < 12) in example 
e (l<e<6) and in the b th (1 < b < 12) bootstrap 
simulation. The mean standardized rank (MSR) for 

method m and example e is calculated as MSR me = y~^ f ,_ 1 
( r meb/# of methods compared )/B and the aggregated 
standardized rank (ASR) is calculated as ASR m = 
MSR me /6, representing the overall performance of 
method m across all six examples. Additional file 1: Table 
S4 shows the MSR and ASR of all 12 methods and 
Figure 2 (in the Result section) shows plot of mean 
with standard error bars for each method ordered by 
ASR. We note that MSR and ASR are both standard- 
ized between 0 and 1. The standardization in MSR is 
necessary because in the breast cancer survival example 



we cannot apply FEM, REM, RankSum and RankProd 
as they are developed only for a two group comparison. 

Biological association 

The second criterion requires that a good meta-analysis 
method should detect a DE gene list that has better associ- 
ation with pre-defined "gold standard" pathways related to 
the targeted disease. Such a "gold standard" pathway set 
should be obtained from biological knowledge for a 
given disease or biological mechanism under investigation. 
However, since most disease or biological mechanisms are 
not well-studied, obtaining such "gold standard" pathways 
is either difficult or questionable. To facilitate this evalu- 
ation without bias, we develop a computational and data- 
driven approach to determine a set of surrogate disease- 
related pathways out of a large collection of pathways by 
combining pathway enrichment analysis results from each 
single study. Specifically, we first collected 2,287 pathways 
(gene sets) from MSigDB (http://www.broadinstitute.org/ 
gsea/msigdb/): 1,454 pathways from "GO", 186 pathways 
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Figure 2 The plot of mean numbers of detected DE genes with error bars of standard error from 50 bootstrapped data sets for the 12 
meta-analysis methods. Note that FEM, REM, RankProd and RankSum cannot be applied to survival examples. 
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from "KEGG", 217 pathways from "BIOCARTA" and 
430 pathways from "REACTOME", respectively. We fil- 
tered out pathways with less than 5 genes or more than 
200 genes and 2,113 pathways were left for the analysis. 
DE analysis was performed in each single study separ- 
ately and pathway enrichment analysis was performed 
for all the 2,113 pathways by the Kolmogorov-Smirnov 
(KS) association test. Denote by p u k the resulting path- 
way enrichment /7-value from KS test for pathway u 
(1 < u < 2,113) and study k (1 < k < K). For a given study 
k, enrichment ranks over pathways were calculated as 
r uk = rank u (p uk ). A rank-sum score for a given pathway 

u was then derived as S u = Intuitively, path- 

ways with small rank-sum scores indicate that they are 
likely associated with the disease outcome by aggregated 
evidence of the K individual study analyses. We choose 
the top \D\ pathways that had the smallest rank-sum 
scores as the surrogate disease-related pathways and 
used these to proceed with the biological association 
evaluation of meta-analysis methods in the following. 

Given the selected surrogate pathways D, the following 
procedure was used to evaluate performance of the 12 
meta-analysis methods for a given example e (1 < e < 6). 
For each meta-analysis method m (1 < m < M = 12), the 
DE analysis result was associated with pathway u and 
the resulting enrichment /?-value by KS-test was de- 
noted by P me d (l<d<\D\). The rank of P me d for method 
m among 12 methods was denoted by v mec i = rank m 
[Pmed) ■ Similar to the detection capability evaluation, 
we calculated the mean standardized rank (MSR) 

for method m and example e as MSR me = (Vmed/ # 

of the methods compared) /D and the aggregated standard- 
ized rank (ASR) as ASR m = ^^^MSRme/6, representing 
the overall performance of method m. To select the param- 
eter |D| for surrogate disease-related pathways, Additional 
file 1: Figure S4 shows the trend of MSR me (on the j-axis) 
versus \D\ (on the .x-axis) as |D| increases. The result indi- 
cated that the performance evaluation using different D 
only minimally impacted the conclusion when D > 30. We 
choose D = 100 throughout this paper. 

Note that we used KS test, instead of the popular Fisher s 
exact test because each single study detected variable 
number of DE genes under a given FDR cutoff and the 
Fisher's exact test is usually not powerful unless a few 
hundred DE genes are detected. On the other hand, the 
KS test does not require an arbitrary p-vahie cutoff to 
determine the DE gene list for enrichment analysis. 

Stability 

The third criterion examines whether a meta-analysis 
method generates stable DE analysis result. To achieve 
this goal, we randomly split samples into half in each 



study (so that cases and controls are as equally split as 
possible). The first half of each study was taken to per- 
form the first meta-analysis and generate a DE analysis 
result. Similarly, the second half of each study was taken 
to perform a second meta-analysis. The generated DE 
analysis results from two separate meta-analyses were 
compared by the adjusted DE similarity measure (to be 
described in the next section). The procedure is repeated 
for B = 50 times. Denote by S me b the adjusted DE similarity 
measure of method m of the b th simulation in example e. 
Similar to the first two criteria, MSR and ASR were 
calculated based on S me b to evaluate the methods. 

Robustness 

The final criterion investigates the robustness of a meta- 
analysis method when an oudying irrelevant study is mis- 
takenly added to the meta-analysis. For each of the six real 
examples, we randomly picked one irrelevant study from 
the other five examples, added it to the specific example 
for meta-analysis and evaluated the change from the ori- 
ginal meta-analysis. The adjusted DE similarity measure 
was calculated between the original meta-analysis and the 
new meta-analysis with an added outlier. A high adjusted 
DE similarity measure shows better robustness against in- 
clusion of the outlying study. This procedure was repeated 
until all irrelevant studies were used. The MSR and ASR 
are then calculated based on the adjusted DE similarity 
measures to evaluate the methods. 

Similarity measure between two ordered DE gene lists 

To compare results of two DE detection methods (from 
single study analysis or meta-analysis), a commonly used 
method in the literature is to take the DE genes under 
certain p-value or FDR threshold, plot the Venn diagram 
and compute the ratio of overlap. This method, however, 
greatly depends on the selection of FDR threshold and is 
unstable. Another approach is to take the generated DE 
ordered gene lists from two methods and compute the 
non-parametric Spearman rank correlation [24]. This 
method avoids the arbitrary FDR cutoff but gives, say, 
the top 100 important DE genes and the bottom 100 
non-DE genes equal contribution. To circumvent this 
pitfall, Li et al. proposed a parametric reproducibility 
measure for ChlP-seq data in the ENCODE project [25]. 
Yang et al. introduced an OrderedList measure to quantify 
similarity of two ordered DE gene lists [26]. For simplicity, 
we extended the OrderedList measure into a standardized 
similarity score for the evaluation purpose in this paper. 
Specifically, suppose Gj and G2 are two ordered DE gene 
lists (e.g. ordered by j?-values) and small ranks represent 
more significant DE genes. We denote by 0„(Gi, G 2 ) the 
number of overlapped genes in the top n genes of Gy and 
G2. As a result, 0 < 0„(Gi, G2) < n and a large 0„(Gi, G2) 
value indicates high similarity of the two ordered lists in 
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the top n genes. A weighted average similarity score is cal- 
culated as S(Gi,G 2 ) = 5^H-i e_a "' On{G\,G<i), where G 
is the total number of matched genes and the power a 
controls the magnitude of weights emphasized on the 
top ranked genes. When a is large, top ranked genes 
are weighted higher in the similarity measure. The ex- 
pected value (under the null hypothesis that the two 
gene rankings are randomly generated) and maximum 
value of 5 can be easily calculated: E nu n{S{G\, G2)) = 

Yf n=1 e ~ a "- n2 / G and rnax{S{G 1 ,G 2 )) =Y J G n=1 e ""'- n - 
We apply an idea similar to adjusted Rand index [27] used 
to measure similarity of two clustering results and define 
the adjusted DE similarity measure as 

c*(f r ) - G 2 )-E nu u(S(Gi, G 2 )) 

1 u 2 > ~ Max(5(G 1 ,G 2 ))-£ HB „(5(G 1 ,G 2 )) 

This measure ranges between -1 to 1 and gives an ex- 
pected value of 0 if two ordered gene lists are obtained 
by random chance. Yang et al. proposed a resampling- 
based and ROC methods to estimate the best selection 
of a. Since the number of DE genes in our examples are 
generally high, we choose a relatively small a = 0.001 
throughout this paper. We have tested different a and 
found that the results were similar (Additional file 1: 
Figure S7). 

Results 

Simulation setting 

We conducted simulation studies to evaluate and char- 
acterize the 12 meta-analysis methods for detecting bio- 
markers in the underlying hypothesis settings of HS A , 
HS B or HS r . The simulation algorithm is described below: 

1. We simulated 800 genes with 40 gene clusters 
(20 genes in each cluster) and other 1,200 genes 
do not belong to any cluster. The cluster indexes C g 
for gene g (1 < g < 2, 000) were randomly sampled, 
such that J2 1{C g = 0} = 1, 200 and £ I{C g = c} = 20, 

1 < c < 40. 

2. For genes in cluster c (1 < c < 40) and in study 

k (1 < k< 5), we sampled EaT^O^ 60 )> where 
¥ = O.5/20 x 20 + O.5/20 x 20. 1 denotes the inverse 
Wishart distribution, / is the identity matrix and / 
is the matrix with all elements equal 1. We then 
standardized Y. ck into Y. ck where the diagonal 
elements are all l's. 

3. 20 genes in cluster c was denoted by the index of 
gci' ■■■'gc2o< i-e- C g = c, where l<c<40 and l</<20. 
We sampled gene expression levels of genes in 

cluster c for sample n as (x g ^ nk , ...,X g ^ nk ^j ~MVN 

(0, Y2ck) where 1 < n < 100 and 1 < k < 5, and sample 



expression level for the gene g~N(0, of) which is 
not in any cluster for sample n, where 1 < n < 100, 
1 < k < 5 and a\ was uniformly distributed from 
[0.8, 1.2], which indicates different variance for 
study k. 

4. For the first 1,000 genes (1 <g< 1, 000), k g (the 
number of studies that are differentially expressed 
for gene g) was generated by sampling k g = 1, 2, 
3, 4 and 5, respectively. For the next 1,000 genes 
(1, 001 < g < 2, 000), kg = 0 represents non-DE genes 
in all five studies. 

5. To simulate expression intensities for cases, we 
randomly sampled 6 g k E= {0, 1}, such that E k$gk = k g . 
If Sgk = 1, gene g in study k was a DE gene, otherwise 
it was a non-DE gene. When 8 g k = I, we sampled 
expression intensities fi g k from a uniform distribution 
in the range of [0.5, 3], which means we considered 
the concordance effect (up-regulated) among all 
simulated studies. Hence, the expression for control 
samples are X gn k = X gnk , and case samples are 
Y ink = 4(«+50)* + f^gk-Sgk, for 1 < g < 2, 000, 1 < 

n < 50 and 1 < k< 5. 

Table 1 The detected number of DE genes (at FDR = 5%), 
the true FDR, AUC values under HS A and HS B and the 
concluding characterization of targeted hypothesis 
setting of each method 





maxP 


rOP 


minP 


Fisher 


AW 


Stouffer 


Detected # 


321 


522 


1005 


1000 


1000 


974 


(se) 


(2.2) 


(2.35) 


(0.85) 


(1.06) 


(1.05) 


(1.5) 


True FDR (HS A ) 


.068 


.018 


.447 


444 


.444 


.43 


(se) 


(.0008) 


(.0012) 


(.0006) 


(.0007) 


(.0008) 


(.0009) 


True FDR (HS B ) 


.007 


.011 


.016 


.017 


.016 


.022 


(se) 


(.0005) 


(.0004) 


(.0006) 


(.0006) 


(.0007) 


(.0006) 


AUC (HS A ) 


.996 


.964 


.8 


.82 


.79 


.89 


(se) 


(.0003) 


(.0014) 


(.0005) 


(.0005) 


(.0005) 


(.0006) 


AUC (HS B ) 


.75 


.833 


.99 


.99 


.99 


.99 


(se) 


(.0013) 


(.01) 


(.0001) 


(.0001) 


(.0001) 


(.0005) 


Characterization 


HS A 


HS, 


H5 B 


HS B 


HS B 


HS B 




PR 


SR 


FEM 


REM 


RankProd 


RankSum 


Detected # 


136 


186 


948 


411 


391 


105 


(se) 


(2.51) 


(2.3) 


(1.75) 


(2.86) 


(3.31) 


(1.514) 


True FDR (H5 A ) 


.008 


.01 


.415 


.117 


.13 


.389 


(se) 


(.0003) 


(.0004) 


(.0009) 


(.0015) 


(.0014) 


(.0008) 


True FDR (HS B ) 


0 


0 


.022 


.007 


0 


0 


(se) 


(0) 


(0) 


(.0007) 


(.0004) 


(0) 


(0) 


AUC (HS A ) 


.986 


.99 


.917 


.99 


.916 


.504 


(se) 


(.0003) 


(.0002) 


(.0009) 


(.0002) 


(.0011) 


(.0046) 


AUC (HS B ) 


.981 


.95 


.984 


.92 


.934 


.496 


(se) 


(.0004) 


(.0008) 


(.0004) 


(.0011) 


(.0012) 


(.0025) 


Characterization 


HS A 


H5 A 


HS B 


HS r 


HS B 


HS B 
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In the simulation study, we had 1,000 non-DE genes 
in all five studies (k g = 0), and 1,000 genes were differen- 
tially expressed in 1 ~ 5 studies (k g = 1, 2, 3, 4, 5). On 
average, we had roughly the same number (~200) of 
genes in each group of k g = 1, 2, 3, 4, 5. See Additional 
file 1: Figure S2 for the heatmap of a simulated example 
(red colour represents up-regulated genes). We applied 
the 12 meta-analysis method under FDR control at 5%. 
With the knowledge of true k g , we were able to derive 
the sensitivity and specificity for HSa and HS B , respect- 
ively. In HSa, genes with k g = 5 were the underlying true 
positives and genes with k g = 0 ~ 4 were the underlying 
true negatives; in HS B , gene with k g = 1 ~ 5 were the 
underlying true positives and genes with k g = 0 were the 
true negatives. By adjusting the decision cut-off, the re- 
ceiver operating characteristic (ROC) curves and the 
resulting area under the curve (AUC) were used to evalu- 
ate the performance. We simulated 50 data sets and re- 
ported the means and standard errors of the AUC values. 
AUC values range between 0 and 1. AUC = 50% represents 
a random guess and AUC = 1 reaches the perfect predic- 
tion. The above simulation scheme only considered the 
concordance effect sizes (i.e. all with up-regulation when a 
gene is DE in a study) among five simulated studies. In 
many applications, some genes may have p-vaiue statistical 
significance in the meta-analysis but the effect sizes are 
discordant (i.e. a gene is up-regulation in one study but 
down-regulation in another study). To investigate that 



effect, we performed a second simulation that considers 
random discordant cases. In step 5, the figk became a 
mixture of two uniform distributions: n gk Unif -[-3, -0.5] + 
(1 - ngk) - Unif[0.5, 3], where n g k is the probability of gene g 
(1 <g< 2,000) in study k{l<k<5) to have a discordant 
effect size (down-regulated). We set n gk = 0.2 for the dis- 
cordant simulation setting. 

Simulation results to characterize the methods 

The simulation study provided the underlying truth to 
characterize the meta-analysis methods according to their 
strengths and weaknesses for detecting DE genes of differ- 
ent hypothesis settings. The performances of 12 methods 
were evaluated by receiver operating characteristic (ROC) 
curves, which is a visualization tool that illustrates the sen- 
sitivity and specificity trade-off, and the resulting area 
under the ROC curve (AUC) under two different hypoth- 
esis settings of HS A and HS B . Table 1 shows the detected 
number of DE genes under nominal FDR at 5%, the true 
FDR and AUC values under HSa and HS B for all 12 
methods. The values were averaged over 50 simulations 
and the standard errors are shown in the parentheses. 

Figure 3 shows the histogram of the true number of 
DE studies (i.e. k g ) among the detected DE genes under 
FDR = 5% for each method. It is clearly seen that minP, 
Fisher, AW, Stouffer and FEM detected HS B -type DE 
genes and had high AUC values under HS B criterion 
(0.98-0.99), compared to lower AUC values under HSa 
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Figure 3 The histograms of the true number of DE studies among detected DE genes under FDR = 5% in each method. 



Chang et al. BMC Bioinformatics 2013, 14:368 
http://www.biomedcentral.com/1471-2105/14/368 



Page 1 0 of 1 5 



criterion (0.79-0.9). For these methods, the true FDR for 
HS A generally lost control (0.41- 0.44). On the other 
hand, maxP, rOP and REM had high AUC under HS A 
criterion (0.96-0.99) (true FDR = 0.068-0.117) compared 
to HS B (0.75-0.92). maxP detected mostly HS A -type of 
markers and rOP and REM detected mostly HS r -type 
DE genes. PR and SR detected mostly HSA-type DE 
genes but they surprisingly had very high AUC under 
both HS A and HS B criteria. The RankProd method 
detected DE genes between HS r and HSb types and had 
a good AUC value under HSb- The RankSum detected 
HS B -type DE genes but had poor AUC values (0.5) for 
both HSa and HSb- Table 1 includes our concluding 
characterization of the targeted hypothesis settings for 
each meta-analysis method (see also Additional file 1: 
Figure S5 of the ROC curve and AUC of HS A -type and 
HS B -type in 12 meta-analysis methods). Additional file 1: 
Figure S3 shows the result for the second discordant 
simulation setting. The numbers of studies with opposite 
effect size are represented by different colours in histo- 
gram plot (green: all studies with concordance effect 



size; blue: one study has opposite effect size with the 
remaining; red: two studies have opposite effect size with 
the remaining). In summary, almost all meta-analysis 
methods could not avoid inclusion of genes with opposite 
effect sizes. Particularly, methods utilizing ^-values from 
two-sided tests (e.g. Fisher, AW, minP, maxP and rOP) 
could not distinguish direction of effect sizes. Stouffer was 
the only method that accommodated the effect size direc- 
tion in its z-transformation formulation but its ability to 
avoid DE genes with discordant effect sizes seemed still 
limited. Owen (2009) proposed a one-sided correction pro- 
cedure for Fisher's method to avoid detection of discordant 
effect sizes in meta-analysis [28]. The null distribution of 
the new statistic, however, became difficult to derive. The 
approach can potentially be extended to other methods 
and more future research will be needed for this issue. 

Results of the four evaluation criteria 
Detection capability 

Figure 2 shows the number of DE genes identified by each 
of the 12 meta-analysis methods (FDR = 10% for MDD 
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Figure 4 Plots of mean values of -log 10 (p) with error bars of standard error from KS-test based on the top 100 surrogate pathways. 

Note that FEM, REM, RankProd and RankSum cannot be applied to survival examples. 
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and breast cancer due to their weak signals and FDR = 1% 
for all the others). Each plot shows mean with standard 
error bars for 50 bootstrapped data sets. Additional file 1: 
Table S4 shows the MSR and ASR for each method in the 
six examples. The methods in Figure 2 are ordered ac- 
cording to their ASR values. The top six methods with the 
strongest detection capability were those that detected 
HSs-type DE genes from the conclusion of Table 1: Fisher, 
AW, Stouffer, minP, FEM and RankSum. The order of 
performance of these six methods was pretty consistent 
across all six examples. The next four methods were 
rOP, RankProd, maxP and REM and they targeted on 
either HS r or HSa- PR and SR had the weakest detection 
capability, which was consistent with the simulation 
result in Table 1. 

Biological association 

Figure 4 shows plots of mean with standard error 
bars from the pathway association ^-values (minus log- 
transformed) of the top 100 surrogate disease-related path- 
ways for the 12 methods. Additional file 1: Table S5 shows 



the corresponding MSR and ASR. We found that Stouffer, 
Fisher and AW had the best performance among the 12 
methods. Surprisingly we found that although PR and SR 
had low detection capability in simulation and real data, 
they consistently had relatively high biological association 
results. This may be due to the better DE gene ordering 
results these two methods provide, as was also shown by 
the high AUC values under both hypothesis settings in the 
simulation. 

Stability 

Figure 5 shows the plots of mean with standard error 
bars of stability calculated by adjusted DE similarity 
measure. Additional file 1: Table S6 contains the corre- 
sponding MSR and ASR. In summary, RankProd and 
RankSum methods were the most stable meta-analysis 
methods probably because these two nonparametric ap- 
proaches take into account all possible fold change cal- 
culations between cases and controls. They do not need 
any distributional assumptions, which provided stability 
even when sample sizes were small [29] . The maximum 
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Figure 5 Plots of mean with error bars of standard error of stability in six examples based on the adjusted similarity between DE 
results of two randomly split data sets. Note that FEM, REM, RankProd and RankSum cannot be applied to survival examples. 
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jj-value method consistently had the lowest stability in 
all data sets, which is somewhat expected. For a given 
candidate marker with a small maximum j?-value, the 
chance that at least one study has significantly inflated 
j?-values is high when sample size is reduced by half. 
The stability measures in the breast cancer example 
were generally lower than other examples. This is 
mainly due to the weak signals for survival outcome 
association, which might be improved if larger sample 
size is available. 

Robustness 

Figure 6 shows the plots of mean with standard error 
bars of robustness calculated by adjusted DE similarity 
measure between the original meta-analysis and the new 
meta-analysis with an added outlier. Additional file 1: 
Table S7 shows the corresponding MSR and ASR values. 
In general, methods suitable for HS B (minP, AW, Fisher 
and Stouffer) have better robustness than methods for 
HSa or HS r (e.g. maxP and rOP). The trend is consistent 
in the prostate cancer, brain cancer and IPF examples 



but is more variable in the weak-signal MDD and breast 
cancer examples. RankSum was surprisingly the most 
sensitive method to outliers, while RankProd performs 
not bad. 

Characterization of methods by MDS plots 

We applied the adjusted DE similarity measure to quan- 
tify the similarity of the DE gene orders from any two 
meta-analysis methods. The resulting dissimilarity measure 
(i.e. one minus adjusted similarity measure) was used to 
construct the multidimensional scaling (MDS) plot, 
showing the similarity/ dissimilarity structure between 
the 12 methods in a two-dimensional space. When two 
methods were close to each other, they generated similar 
DE gene ordering. The patterns of MDS plots from six 
examples generated quite consistent results (Additional 
file 1: Figure S6). Figure 1(a) shows an aggregated MDS 
plot where the input dissimilarity matrix is averaged 
from the six examples. We clearly observed that Fisher, 
AW, Stouffer, minP, PR and SR were consistently clus- 
tered together in all six individual and the aggregated 
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Figure 6 Plots of mean with error bars of standard error of robustness in six examples based on the adjusted similarity between DE 
results with/without adding one irrelevant noise study. Note that FEM, REM, RankProd and RankSum cannot be applied to survival examples. 
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MDS plot (labeled in red). This is not surprising given 
that these methods all sum transformed p-vahie evi- 
dence across studies (except for minP). Two methods to 
combine effect sizes and two methods to combine ranks 
(FEM, REM, RankProd and RankSum labeled in blue) 
are consistently clustered together. Finally, the maxP 
and rOP methods seem to form a third loose cluster 
(labeled in green). 

Characterization of data sets by entropy measure 

From the simulation study, selection of a most suitable 
meta-analysis method depends on the hypothesis setting 
behind the methods. The choice of a hypothesis setting 
mostly depends on the biological purpose of the analysis; 
that is, whether one aims to detect candidate markers 
differentially expressed in "all" {HSa), "most" (HS r ) or 
"one or more" (HSb) studies. However, when no bio- 
logical prior information or preference exists, the en- 
tropy measure can be objectively used to determine the 
choice of hypothesis setting. The analysis identifies the 
top 1,000 genes from Fisher's meta-analysis method and 
the gene-specific entropy of each gene is calculated. 
When the entropy is small, the ^-values are small in only 
one or very few studies. Conversely, when the entropy 
is large, most or all of the studies have small p-values. 
Figure 1(b) shows the box-plots of entropy of the top 
1,000 candidate genes identified by Fisher's method in 
the six data sets. The result shows that prostate cancer 
comparing primary and metastatic tumor samples had 
the smallest entropy values, which indicated high hetero- 
geneity across the three studies and that HSb should 
be considered in the meta-analysis. On the other hand, 
MDD had the highest entropy values. Although the sig- 
nals of each MDD study were very weak, they were 
rather consistent across studies and application of HSa 
or HS r was adequate. For the other examples, we suggest 



using the robust HS r unless other prior biological pur- 
pose is indicated. 

Conclusions and discussions 

An application guideline for practitioners 

From the simulation study, the 12 meta-analysis methods 
were categorized into three hypothesis settings (HSa, HSb 
and HS r ), showing their strengths for detecting different 
types of DE genes in the meta-analysis (Figure 3 and the 
second column of Table 2). For example, maxP is catego- 
rized to HSa since it tends to detect only genes that are 
differentially expressed in all studies. From the results 
using four evaluation criteria, we summarized the rank of 
ASR values (i.e. the order used in Figures 2 and 6) and cal- 
culated the rank sum of each method in Table 2. The 
methods were then sorted first by the hypothesis setting 
categories and then by the rank sum. The clusters of 
methods from the MDS plot were also displayed. For 
methods in the HSa category, we surprisingly see that the 
maxP method performed among the worst in all four 
evaluation criteria and should be avoided. PR was a better 
choice in this hypothesis setting although it provides a 
rather weak detection capability. For HSb, Fisher, AW 
and Stouffer performed very well in general. Among 
these three methods, we note that AW has an additional 
advantage to provide an adaptive weight index that indi- 
cates the subset of studies contributing to the meta- 
analysis and characterizes the heterogeneity (e.g. adaptive 
weight (1,0,...) indicates that the marker is DE in study 1 
but not in study 2, etc.). As a result, we recommend AW 
over Fisher and Stouffer in the HSb category. For HS„ the 
result was less conclusive. REM provided better stability 
and robustness but sacrificed detection capability and 
biological association. On the other hand, rOP obtained 
better detection capability and biological association but 
was neither stable nor robust. In general, since detection 



Table 2 Ranks of method performance in the four evaluation criteria 





Targeted HS 


Power detection 


Biological association 


Stability 


Robustness 


Rank Sum 


MDS* 1 


PR 


HS A 


12 


4 


4 


6 


26 


1 


SR 


HS A 


11 


6 


9 


7 


33 


1 


maxP 


HS A 


9 


10 


12 


11 


42 


2 


rOP 


HS, 


7 


5 


10 


10 


32 


2 


REM 


HS r 


10 


11 


5 


8 


34 


3 


Fisher 


HS B 


1 


2 


3 


3 


9 


1 


AW 


HS B 




3 


6 


2 


13 


1 


Stouffer 


HS B 


3 


1 


8 


4 


16 


1 


minP 


HS B 


4 


7 


7 


1 


19 


1 


RankProd 


HS B 


8 


8 


1 


5 


22 


3 


RankSum 


HS B 


6 


12 


2 


12 


32 


3 


FEM 


HS B 


5 


9 


11 


9 


34 


3 



* 1 Cluster number of methods in the MDS plot (Figure 1(a)). 
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capability and biological association are of more import- 
ance in the meta-analysis and rOP has the advantage to 
link the choice of r in HS r with the rOP method (e.g. when 
r = 0.7-K, we identify genes that are DE in more than 70% 
of studies), we recommend rOP over REM. 

Below, we provide a general guideline for a practitioner 
when applying microarray meta-analysis. Data sets of a 
relevant biological or disease hypothesis are firstly 
identified, preprocessed and annotated according to 
Step (i) - (v) in Ramasamy et al. Proper quality assess- 
ment should be performed to exclude studies with 
problematic quality (e.g. with the aid of MetaQC as we 
did in the six examples). Based on the experimental design 
and biological objectives of collected data, one should 
determine whether the meta-analysis aims to identify 
biomarkers differentially expressed in all studies {HSa), in 
one or more studies (HSb) or in majority of studies (HS r ). 
In general, if higher heterogeneity is expected from, say, 
heterogeneous experimental protocol, cohort or tissues, 
HSb should be considered. For example, if the combined 
studies come from different tissues (e.g. the first study 
uses peripheral blood, the second study uses muscle tissue 
and so on), tissue-specific markers may be expected and 
HSr should be applied. On the contrary, if the collected 
studies are relatively homogeneous (e.g. use the same array 
platform or from the same lab), HS r is generally recom- 
mended, as it provides robustness and detects consistent 
signals across the majority of studies. In the situation that 
no prior knowledge is available to choose a desired hy- 
pothesis setting or if the researcher is interested in a data- 
driven decision, the entropy measure in Figure 1(b) can be 
applied and the resulting box-plot can be compared to the 
six examples in this paper to guide the decision. Once the 
hypothesis setting is determined, the choice of a meta- 
analysis method can be selected from the discussion above 
and Table 2. 

Conclusions 

In this paper, we performed a comprehensive comparative 
study to evaluate 12 microarray meta-analysis methods 
using simulation and six real examples with four evalu- 
ation criteria. We clarified three hypothesis settings that 
were implicidy assumed behind the methods. The evalu- 
ation results produced a practical guideline to inform biol- 
ogists the best choice of method(s) in real applications. 

With the reduced cost of high-throughput experi- 
ments, data from microarray, new sequencing tech- 
niques and mass spectrometry accumulate rapidly in 
the public domain. Integration of multiple data sets has 
become a routine approach to increase statistical power, 
reduce false positives and provide more robust and vali- 
dated conclusions. The evaluation in this paper fo- 
cuses on microarray meta-analysis but the principles and 
messages apply to other types of genomic meta-analysis 



(e.g. GWAS, methylation, miRNA and eQTL). When 
the next-generation sequencing technology becomes 
more affordable in the future, sequencing data will be- 
come more prevalent as well and similar meta-analysis 
techniques will apply. For these different types of gen- 
omic meta-analysis, similar comprehensive evaluation 
could be performed and application guidelines should be 
established as well. 
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Additional file 1: Supplementary methods of (1) Adaptive weighted 
(AW) Fisher, (2) Combined statistical estimates (effect size) methods 
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Product (RankProd) and Rank Sum (RankSum). Figure SI. MetaQC. 
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